Text Segmentation with Multiple Surface Linguistic Cues

نویسندگان

  • Hajime Mochizuki
  • Takeo Honda
  • Manabu Okumura
چکیده

In general, a certain range of sentences in a text, is widely assumed to form a coherent unit which is called a discourse segment. Identifying the segment boundaries is a first step to recognize the structure of a text. In this paper, we describe a method for identifying segment boundaries of a Japanese text with the aid of multiple surface linguistic cues, though our experiments might be small-scale. We also present a method of training the weights for multiple linguistic cues automatically without the overfitting problem. :l I n t r o d u c t i o n A text consists of multiple sentences that have semantic relations with each other. They form semantic units which are usually called discourse segments. The global discourse structure of a text can be constructed by relating the discourse segments with each other. Therefore, identifying segment boundaries in a text is considered as a first step to construct the discourse structure(Grosz and Sidner, 1986). The use of surface linguistic cues in a text for identification of segment boundaries has been extensively researched, since it is impractical to assume the use of world knowledge for discourse analysis of real texts. Among a variety of surface cues, lexical cohesion(Halliday and Hasan, 1976), the surface relationship among words that are semantically similar, has recently received much attention and has been widely used for text segmentation(Morris and Hirst, 1991; Kozima, 1993; Hearst, 1994; Okumura and Honda, 1994). Okumura and Honda (Okumura and Honda, 1994) found that the information oflexical cohesion is not enough and incorporation of other surface information may improve the accuracy. In this paper, we describe a method for identifying segment boundaries of a Japanese text with the aid of multiple surface linguistic cues, such as conjunctives, ellipsis, types of sentences, and lexical cohesion. There are a variety of methods for combining multiple knowledge sources (linguistic cues)(McRoy, 1992). Among them, a weighted sum of the scores for all cues that reflects their contribution to identifying the correct segment boundaries is often used as the overall measure to rank the possible segment boundaries. In the past researches (Kurohashi and Nagao, 1994; Cohen, 1987), the weights for each cue tend to be determined by intuition or trial and error. Since determining weights by hand is a labor-intensive task and the weights do not always to achieve optimal or even near-optimal performance(Rayner et al., 1994), we think it is better to determine the weights automatically in order to both avoid the need for expert hand tuning and achieve performance that is at least locally optimal. We begin by assuming the existence of training texts with the correct segment boundaries and use the method of multiple regression analysis for automatically training the weights. However, there is a well-known problem in the methods of automatically training the weights, that the weights tend to be overfitted to the training data. In such a case, the weights cause the degrade of the performance for other texts. It is considered that the overfitting problem is caused by the relatively large number of the parameters (linguistic cues) compared with the size of the training data. Furthermore, all of the linguistic cues are not always useful. Therefore, we optimize the use of cues for training the weights. We think if only the useful cues are selected from the entire set of cues, bet ter weights can be obtained. Fortunately, since several methods for parameters selection are already developed in the multiple regression analysis, we use one of these methods called the stepwise method. Therefore we think we can obtain the weights only for the useful by the using the multiple regression analysis and the stepwise method. To give the evidence for the above claims that are summarized below, we carry out some preliminary experiments to show the effectiveness of our approach, even though our experiments might be small-scale. • Combining multiple surface cues is effective for text segmentation. • The multiple regression analysis with the stepwise method is good for selecting the useful cues for text segmentation and weighting these cues automatically. In section two we outline the surface linguistic cues that we use for text segmentation. In section three

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic Tracking of Obsolescent Segments with Linguistic Cues

This paper deals with the description and the automatic tracking of text segments containing obsolescence in encyclopedia texts. We assume that despite the non-linguistic nature of this phenomenon, discursive cues are relevant to track those segments. For that purpose, we have worked on a corpus which has been manually annotated by experts and on which we have projected automatically tracked cu...

متن کامل

Discourse Segmentation by Human and Automated Means

The need to model the relation between discourse structure and linguistic features of utterances is almost universally acknowledged in the literature on discourse. However, there is only weak consensus on what the units of discourse structure are, or the criteria for recognizing and generating them. We present quantitative results of a two-part study using a corpus of spontaneous, narrative mon...

متن کامل

Discourse Segmentation of Multi-Party Conversation

We present a domain-independent topic segmentation algorithm for multi-party speech. Our feature-based algorithm combines knowledge about content using a text-based algorithm as a feature and about form using linguistic and acoustic cues about topic shifts extracted from speech. This segmentation algorithm uses automatically induced decision rules to combine the different features. The embedded...

متن کامل

Topic Segmentation : A First Stage to Dialog-Based Information Extraction

We study the problem of topic segmentation of manually transcribed speech in order to facilitate information extraction from dialogs. Our approach is based on a combination of multi-source knowledge modeled by hidden Markov models. We experiment with different combinations of linguistic-level cues on dialogs dealing with search and rescue missions. Results show the effectiveness of multi-source...

متن کامل

Statistical Aggregation and Hypothesis Testing Mechanisms Interact during Word Learning

Referential utterances are by their nature ambiguous to novice language learners. Each utterance consists of multiple layers of information that must be decoded: 1) the linguistic structure (how the sounds and words should be packaged into meaningful units), 2) the world structure (how people, objects and actions in the world relate to one another and which is the current focus of attention) an...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998